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Abstract 

o 

£SJ . In this paper, we consider the problem of preserving privacy in the online learning setting. Online learning 

involves learning from the data in real-time, so that the learned model as well as its outputs are also continuously 
'. changing. This makes preserving privacy of each data point significantly more challenging as its effect on the 

learned model can be easily tracked by changes in the subsequent outputs. Furthermore, with more and more 
. online systems (e.g. search engines like Bing, Google etc.) trying to learn their customer's behavior by leveraging 

their access to sensitive customer data (through cookies etc), the problem of privacy preserving online learning has 
become critical as well. 

We study the problem in the online convex programming (OCP) framework — a popular online learning setting 
with several interesting theoretical and practical implications — while using differential privacy as the formal pri- 
vacy measure. For this problem, we distill two critical attributes that a private OCP algorithm should have in order 
to provide reasonable privacy as well as utility guarantees: 1) linearly decreasing sensitivity, i.e., as new data points 
arrive their effect on the learning model decreases, 2) sub-linear regret bound — regret bound is a popular good- 
ness/utility measure of an online learning algorithm. Given an OCP algorithm that satisfies these two conditions, 
we provide a general framework to convert the given algorithm into a privacy preserving OCP algorithm with good 
(sub-linear) regret. We then illustrate our approach by converting two popular online learning algorithms into their 
differentially private variants while guaranteeing sub-linear regret (0{\/T)). Next, we consider the special case of 
online linear regression problems, a practically important class of online learning problems, for which we general- 
ize an approach by lf]~3l to provide a differentially private algorithm with just 0(log 15 T) regret. Finally, we show 
. that our online learning framework can be used to provide differentially private algorithms for offline learning as 

well. For the offline learning problem, our approach obtains better error bounds as well as can handle larger class 
of problems than the existing state-of-the-art methods 0. 

■ 1 Introduction 

& ' As computational resources are increasing rapidly, modern websites and online systems are able to process large 
amounts of information gathered from their customers in real time. While typically these websites intend to learn 
and improve their systems in real-time using the available data, this also represents a severe threat to the privacy of 
customers. 

For example, consider a generic scenario for a web search engine like Bing. Sponsored advertisements (ads) served 
with search results form a major source of revenue for Bing, for which, Bing needs to serve ads that are relevant to the 
user and the query. As each user is different and can have different definition of "relevance", many websites typically 
try to learn the user behavior using past searches as well as other available demographic information. This learning 
problem has two key features: a) the advertisements are generated online in response to a query, b) feedback for 
goodness of an ad for a user cannot be obtained until the ad is served. Hence, the problem is an online learning game 
where the search engine tries to guess (from history and other available information) if a user would like an ad and 
gets the cost/reward only after making that online decision; after receiving the feedback the search engine can again 
update its model. This problem can be cast as a standard online learning problem and several existing algorithms can 
be used to solve it reasonably well. 



'Part of the work was done while visiting Microsoft Research India. 
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However, processing critical user information in real-time also poses severe threats to a user's privacy. For ex- 
ample, suppose Bing in response to certain past queries (let say about a disease), promotes a particular ad which 
otherwise doesn't appear at the top and the user clicks that ad. Then, the corresponding advertiser should be able to 
guess user's past queries, thus compromising privacy. Hence, it is critical for the search engine to use an algorithm 
which not only provides correct guess about relevance of an ad to a user, but also guarantees privacy to the user. Some 
of the other examples where privacy preserving online learning is critical are online portfolio management ll24l . online 
linear prediction |[20l etc. 

In this paper, we address privacy concerns for online learning scenarios similar to the ones mentioned above. 
Specifically, we provide a generic framework for privacy preserving online learning. We use differential privacy ifTll 
as the formal privacy notion, and use online convex programming (OCP) ll36l as the formal online learning model. 

Differential privacy is a popular privacy notion with several interesting theoretical properties. Recently, there has 
been a lot of progress in differential privacy. However, most of the results assume that all of the data is available 
beforehand and an algorithm processes this data to extract interesting information without compromising privacy. In 
contrast, in the online setting that we consider in this paper, data arrives onlind3 (e.g. user queries and clicks) and the 
algorithm has to provide an output (e.g. relevant ads) at each step. Hence, the number of outputs produced is roughly 
same as the size of the entire dataset. Now, to guarantee differential privacy one has to analyze privacy of the complete 
sequence of outputs produced, thereby making privacy preservation a significantly harder problem in this setting. In a 
related work, |[T3ll also considered the problem of differential private online learning. Using the online experts model 
as the underlying online learning model, lfl3ll provided an accurate differentially private algorithm to handle counting 
type problems. However, the setting and the class of problems handled by |[T3l is restrictive and it is not clear how 
their techniques can be extended to handle typical online learning scenarios, such as the one mentioned above. See 
Section ITTTI for a more detailed comparison to |[T3l . 

Online convex programming (OCP), that we use as our underlying online learning model, is an important and 
powerful online learning model with several theoretical and practical applications. OCP requires that the algorithm 
selects an output at each step from a fixed convex set, for which the algorithm incurs cost according to a convex 
function (that maybe different at each step). The cost function is revealed only after the point is selected. Now 
the goal is to minimize the regret, i.e., total "added" loss incurred in comparison to the optimal offline solution — a 
solution obtained after seeing all the cost functions. OCP encompasses various online learning paradigms and has 
several applications such as portfolio management IT321 . Now, assuming that each of the cost function is bounded over 
the fixed convex set, regret incurred by any OCP algorithm can be trivially bounded by 0(T) where T is the total 
number of time-steps for which the algorithm is executed. However, recently several interesting algorithms have been 
developed that can obtain regret that is sub-linear in T. That is, as T — > oo, the total cost incurred is same as the 
cost incurred by the optimal offline solution. In this paper, we use regret as a "goodness" or "utility" property of an 
algorithm and require that a reasonable OCP algorithm should at least have sub-linear regret. 

To recall, we consider the problem of differentially private OCP , where we want to provide differential privacy 
guarantees along with sub-linear regret bound. To this end, we provide a general framework to convert any online 
learning algorithm into a differentially private algorithm with sub-linear regret, provided that the algorithm satisfies 
two criteria: a) linearly decreasing sensitivity (see Definition 0, b) sub-linear regret. We then analyze two popu- 
lar OCP algorithms namely, Implicit Gradient Descent (IGD) ll27l and Generalized Infinitesimal Gradient Ascent 
(GIGA) ll36l to guarantee differential privacy as well as 0(y/T) regret for a fairly general class of strongly convex, 
Lipschitz continuous gradient functions. In fact, we show that IGD can be used with our framework for even non- 
differentiable functions.We then show that if the cost functions are quadratic functions (e.g. online linear regression), 
then we can use another OCP algorithm called Follow The Leader (FTL) BUI l22l along with a generalization of a 
technique by |[T3l to guarantee 0(ln 15 T) regret while preserving privacy. 

Furthermore, our differentially private online learning framework can be used to obtain privacy preserving algo- 
rithms for a large class of offline learning problems [3l as well. In particular, we show that our private OCP framework 
can be used to obtain good generalization error bounds for various offline learning problems using techniques from 
ll23l (see Section 14.2b - Our differentially private offline learning framework can handle a larger class of learning 
problems with better error bounds than the existing state-of-the-art methods 0. 

'At each time step one data entry arrives. 
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1.1 Related Work 



As more and more of world's information is being digitized, privacy has become a critical issue. To this end, 
several ad-hoc privacy notions have been proposed, however, most of them stand broken now. De-anonymization of 
the Netflix challenge dataset by ||3T1 and of the publicly released AOL search logs |U1 are two examples that were 
instrumental in discarding these ad-hoc privacy notions. Even relatively sophisticated notions such as fc-anonymity 
ll34l and ^-diversity ll28l have been permeated through by attacks 11161 , Hence, in pursuit of a theoretically sound 
notion of privacy , (XT] proposed differential privacy, a cryptography inspired definition of privacy. This notion has 
now been accepted as the standard privacy notion, and in this work we adhere to this notion for our privacy guarantees. 

Over the years, the privacy community have developed differentially private algorithms for several interesting 
problems J6j|7][8]]. In particular, there exists many results concerning privacy for learning problems Il2ll3l [35ll29ll33l . 
Among these, O is of particular interest as they consider a large class of learning problems that can be written as 
(offline) convex programs. Interestingly, our techniques can be used to handle the offline setting of Q as well and in 
fact, our method can handle larger class of learning problems with better error bounds (see Section l4~2l ). 

As mentioned earlier, most of the existing work in differentially private learning has been in the offline setting 
where the complete dataset is provided upfront. One notable exception is the work of lfT3l . where authors formally 
defined the notion of differentially private learning when the data arrives online. Specifically, lPT3l defined two notions 
of differential privacy, namely user level privacy and event level privacy. Roughly speaking, user level privacy guar- 
antees are at the granularity of each user whose data is present in the dataset. In contrast, event level privacy provides 
guarantees at the granularity of individual records in the dataset. It has been shown in |[T3l that it is impossible to 
obtain any non-trivial result with respect to user level privacy. In our current work we use the notion of event level pri- 
vacy. lfT3l also looked at a particular online learning setting called the experts setting, where their algorithm achieves 
a regret bound of 0(ln T) for counting problems while guaranteeing event level differential privacy. However, their 
approach is restricted to experts advice setting, and cannot handle typical online learning problems that arise in prac- 
tice. In contrast, we consider a significantly more practical and powerful class of online learning problems, namely, 
online convex programming, and also provide a method for handling a large class of offline learning problems. 

In a related line of work, there have been a few results that use online learning techniques to obtain differentially 
private algorithms |[T8l fl4l . In particular, |[T8l used experts framework to obtain a differentially private algorithm 
for answering adaptive counting queries on a dataset. However, we stress that although these methods use online 
learning techniques, however they are designed to handle the offline setting only where the dataset is fixed and known 
in advance. 

Recall that in the online setting, whenever a new data entry is added to D, a query has to be answered, i.e., the total 
number of queries to be answer is of the order of size of the dataset. In a line of work started by @ and subsequently 
explored in details by lTT2ll25l . it was shown that if one answers 0(T) subset sum queries on a dataset D £ {0, 1} T 
with noise in each query smaller than y/T, then using those answers alone one can reconstruct a large fraction of 
D. That is, when the number of queries is almost same as the size of dataset, then a reasonably "large" amount of 
noise needs to be added for preserving privacy. Subsequently, there has been a lot of work in providing lower bounds 
(specific to differential privacy) on the amount of noise needed to guarantee privacy while answering a given number 
of queries (see |[T9ll25l l4l). We note that our generic online learning framework (see Section [3TTb also adds noise of 
the order of T°' 5+c , c > at each step, thus respecting the established lower bounds. In contrast, our algorithm for 
quadratic loss function (see Section 13.51 ) avoids this barrier by exploiting the special structure of queries that need to 
be answered. 

1.2 Our Contributions 

Following are the main contributions of this paper: 

1. We formalize the problem of privacy preserving online learning using differential privacy as the privacy no- 
tion and Online Convex Programming (OCP) as the underlying online learning model. We provide a generic 
differentially private framework for OCP in Section[3]and provide privacy and utility (regret) guarantees. 

2. We then show that using our generic framework, two popular OCP algorithms, namely Implicit Gradient De- 
scent (IGD) E71 and Generalized Infinitesimal Gradient Ascent (GIGA) lf36l can be easily transformed into 
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private online learning algorithms with good regret bound. 

3. For a special class of OCP where cost functions are quadratic functions only, we show that we can improve 
the regret bound to 0(ln 15 T) by exploiting techniques from lfl3l . This special class includes a very important 
online learning problem, namely, online linear regression. 

4. In Section l4~2l we show that our differentially private framework for online learning can be used to solve a large 
class of offline learning problems as well (where the complete dataset is available at once) and provide tighter 
utility guarantees than the existing state-of-the-art results Q. 

5. Finally, through empirical experiments on benchmark datasets, we demonstrate practicality of our algorithms 
for practically important problems of online linear regression, as well as, online logistic regression (see Section 
0. 

2 Preliminaries 

2.1 Online Convex Programming 

Online convex programming (OCP ) is one of the most popular and powerful paradigm in the online learning setting. 
OCP can be thought of as a game between a player and an adversary. At each step t, player selects a point x t G R rf 
from a convex set C. Then, adversary selects a convex cost function f t : R rf — > R and the player has to pay a cost 
of ft(xt). Hence, an OCP algorithm A maps a function sequence F = (fi, /2, . . . , fx) to a sequence of points 
X = (aci, £C2, • • • , xt) G C T , i.e., A(F) = X. Now, the goal of the player (or the algorithm) is to minimize the 
total cost incurred over a fixed number (say T) of iterations. However, as adversary selects function f t after observing 
player's move x t , it can make the total cost incurred by the player arbitrarily large. Hence, a more realistic goal for 
the player is to minimize regret, i.e., the total cost incurred when compared to the optimal offline solution x* selected 
in hindsight, i.e., when all the functions have already been provided. Formally, 

Definition 1 (Regret). Let A be an online convex programming algorithm. Also, let A selects a point Xt G C at 
t-th iteration and ft : R rf — > R be a convex cost function served at t-th iteration. Then, the regret Tt_\ of A over T 
iterations is given by: 

T T 

TZ A (T) = ^2ft(x t ) - mm Tftix*)- 

' — * x*£C* — * 
t=l t=l 

Assuming f t to be a bounded function over C, any trivial algorithm A that selects a random point x t G C will have 
0(T) regret. However, several results Il27ll36l show that if each f t is a bounded Lipschitz function over C, O(VT) 
regret can be achieved. Furthermore, if each f t is a "strongly" convex function, O(lnT) regret can be achieved 

2.2 Differential Privacy 

We now formally define the notion of differential privacy in the context of our problem. 

Definition 2 ((e, 5) -differential privacy ifTTl |9l). Let F = /2, • • • , /t) be a sequence of convex functions. Let 
A(F) = X, where X = (x\, X2, ■ ■ ■ , X?) G C t be T outputs of OCP algorithm A when applied to F. Then, a 
randomized OCP algorithm A is (e, 5)-differentially private if given any two function sequences F and F' that differ 
in at most one function entry, for all S C C T the following holds: 

Vy[A(F) g S] < e e Pr[^(F') G S] + 5 

Intuitively, the above definition means that changing an f T G F, r < T to some other function f' T will not modify 
the output sequence X by a large amount. If we consider each f T to be some information associated with an individual, 
then the above definition states that the presence or absence of that individual's entry in the dataset will not affect the 
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output by too much. Hence, output of the algorithm A will not reveal any extra information about the individual. 
Privacy parameters (e, 5) decides the extent to which an individual's entry affects the output; lower values of e and 5 
means higher level of privacy. Typically, 5 should be exponentially small in the problem parameters, i.e., in our case 
5 « exp(— T). 

2.3 Notation 

F = (/l) /2j • • • j fr) denotes the function sequence given to an OCP algorithm A and A(F) = X s.t. X = 
(xi,X2, • • • , xt) £ C T represents output sequence when A is applied to F. We denote the subsequence of functions 
F till the i-th step as F t = (/i, . . . , ft), d denotes the dimensionality of the ambient space of convex set C. Vectors 
are denoted by bold-face symbols, matrices are represented by capital letters. x T y denotes the inner product of x and 
y. \\M\\2 denotes spectral norm of matrix M; recall that for symmetric matrices M, \\M\\2 is the largest eigenvalue 
of M. 

Typically, a is the minimum strong convexity parameter of any f t £ F. Similarly, L and Lq are the largest 
Lipschitz constant and the Lipschitz constant of the gradient of any f t £ F. Recall that a function / : C — > E is 
a-strongly convex, if for all 7 G (0, 1) and for all x,y G C the following holds: /(7a; + (1 — 7)2/) < jf(x) + 
(1 — j)f(y) — §||sc — 2/| 1 2- Al so reca U that a function / is L-Lipschitz, if for all x,y G C the following holds: 
|/(a;) — f(y)\ < L\\x — 2/| 1 2- Function / is Lipschitz continuous gradient if || y f(x) — V/(y)lk < -^gII 33 — y\\2, 
for all x,y G C. Non-private and private versions of an OCP algorithm outputs x t+ i and Xt+i respectively, at time 
step t. x* denotes the optimal offline solution, that is x* = &rgmm x£C Ylt=i ft( x )- T^a{T) denotes regret of an 
OCP algorithm A when applied for T steps. 

3 Differentially Private Online Convex Programming 

In Section I27T1 we defined the online convex programming (OCP ) problem and presented a notion of utility (called 
regret) for OCP algorithms. Recall that a reasonable OCP should have sub-linear regret, i.e., the regret should be 
sub-linear in the number of time steps T. 

In this section, we present a generic differentially private framework for solving OCP problems (see Algorithm 
[]]). We further provide formal privacy and utility guarantees for this framework (see Theorems Q] and We then use 
our private OCP framework to convert two existing OCP algorithms, namely, Implicit Gradient Decent (IGD) ll27l 
and Generalized Infinitesimal Gradient Ascent (GIGA) ll36l into differentially private algorithms using a "generic" 
transformation. For both the algorithms mentioned above, we guarantee (3e, 2 J) -differential privacy with sub-linear 
regret. 

Recall that a differentially private OCP algorithm should not produce a significantly different output for a function 
sequence F[ (with high probability) when compared to F t , where F t and F[ differ in exactly one function. Hence, to 
show differential privacy for an OCP algorithm, we first need to show that it is not very "sensitive" to previous cost 
functions. To this end, below we formally define sensitivity of an OCP algorithm A. 

Definition 3 (L2 -sensitivity Hlll l3l0. Let F, F' be two function sequences differing in at most one entry, i.e., at most 
one function can be different. Then, the sensitivity of an algorithm A : F — >• C T is the difference in the t-th output 
Xt+i = A(F)t of the algorithm A, i.e., 

S(A,t) = sup \\A(F) t - A(F') t \\ 2 . 

F,F> 

As mentioned earlier, another natural requirement for an OCP algorithm is that it should have a provably low 
regret bound. There exists a variety of methods in literature which satisfy this requirement up to different degrees 
depending on the class of the functions f t . 

Under the above two assumptions on the OCP algorithm A, we provide a general framework for adapting the 
given OCP algorithm (A) into a differentially private algorithm. Formally, the given OCP algorithm A should satisfy 
the following two conditions: 
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Algorithm 1 Private OCP Method (POCP) 



Input: OCP algorithm A, cost function sequence F = (f\, ■ ■ ■ , fx) and the convex set C 
Parameter: privacy parameters (e, 5) 
Choose x\ and x\ randomly from C 
for t = 1 to T - 1 do 

Cost: L t {x t ) = f t (x t ) 

OCP Update: x t+ i «- A({ft, ■ ■ • , /*), {x lt . . . , x t ),C) 

Noise Addition: x t+1 <- x t+1 + b t+1 , b t+1 ~ AA(O d , § where /3 = A^T - 5+c A /| (in £ + 

_ lnjln(2/J) 

and c - — 2TJTT — 

Output x t+ i = argmin^gc (\\x - x t+ i\\l) 
end for 



• ^-sensitivity: The L2-sensitivity S(A, t) of the algorithm A should decay linearly with time, i.e., 

S(A,t)<^, (1) 

where > is a constant depending only on A, and strong convexity, Lipschitz constant of the functions in 
F. 

• Regret bound TZ^(T): Regret of A is assumed to be bounded, typically by a sub-linear function of T, i.e., 

T T 

^ — ' a;*£C ,i — ' 
t=l t=l 

Given A that satisfies both dH) and ©, we convert it into a private algorithm by perturbing Xt+i (output of A at t-th 
step) by a small amount of noise, whose magnitude is dependent on the sensitivity parameter of A. Let xt+i be 
the perturbed output, which might be outside the convex set C. As our online learning game requires each output to lie 
in C, we project Xt+i back to C and output the projection Xt+i. Note that, our Private OCP (POCP) algorithm also 
stores the "uncorrupted" iterate x t+ i, as it would be used in the next step. See Algorithm Q] for a pseudo-code of our 
method. 

Now, using the above two assumptions along with concentration bounds for Gaussian noise vectors, we obtain 
both privacy and regret bound for our Private OCP algorithm. See Section [37X1 and [3721 for a detailed analysis of our 
privacy guarantee and the regret bound. 

In Sections [331 and [374] we use our abstract private OCP framework to convert IGD and GIGA algorithms into 
private OCP methods. For both the algorithms, privacy and regret guarantees follow easily from the guarantees of our 
OCP framework once the corresponding sensitivity bounds are established. 



3.1 Privacy Analysis for POCP 

Under the assumption CD), changing one function in the cost function sequence F can lead to a change of at most 
X^/t in the t-th output of A. Hence, intuitively, adding a noise of the same order should make the t-th step output of 
Algorithm [Qdifferentially private. We make the claim precise in the following lemma. 

Lemma 1. Let A be an OCP algorithm that satisfies sensitivity assumption £T|). Also, let c > be any constant 
and j3 = A^T 0,5+C ^| ( m 7T + ^fo^R^)- Then, the t-th step output of Algorithm^ Xt+i, is ( T ^ +c , ^-differentially 
private. 

Proof. As the output xt+ 1 is just a projection, i.e., a function (independent of the input functions F) of Xt+i, hence 
(e, ^-differential privacy for i&t+i would imply the same for x t +\. 
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Now by the definition of differential privacy (see Definition [2]), Xt+i is (ei, ^-differential private, if for any 
measurable set OCR?: 

Pr[x t+1 G n] < e ei Pr[x' t+1 G fi] + 6/T, 

where Xt+i = x t+i + b is the output of the noise addition step (see Algorithm [T] Step 7) of our POCP algorithm, 
when applied to function sequence F t = - - - ,/t). Similarly, x' t+1 = x' t+1 + b is the output of the noise addition 
to a^ +1 which is obtained by applying update step to F/, where F[ differs from F t in exactly one function entry. 



Now, x t +i ~ M(x t+ i, |rl ) and x' t+l ~ jV(a^ +1 , frl )• Let Aa: m = jct+i - a;J +1 . Then, we have (x t +i 
x t +i) T Ax t +i ~ M(0, jr \\Axt+i Hi)- Now, using assumption (fl]) for the OCPalgorithm A and Mill's inequality, 



Pr 



\(x t+ i - x t+ i) T Ax t+1 \ > 



t 2 ' 



< Pr 



(xt+i ~ x t+1 ) T Ax t+1 \ > j\\x t+ i - x' t+1 \\z 



< e 2 



where z > 0. Setting R.H.S. < |, we have z > y 2 In 
Now, we define a "good set" Q: 



x € Q iff \(x - x t+ i) Ax t+ i\ > 



Note that, 



Pr[x t+ i G] = Pr 
We now bound Pr[s t +i G fi]: 



\(xt+i ~ x t+ i) Ax t+1 \ > 



~ T 



(3) 



(4) 



Pr[x t+1 efi]< Pr[x t+ i efing]+ Pr[x t+1 G* Q] < Pv[x t+1 G fi n £] + -. 



(5) 



As Xi+i ~AT(a5 t+ i, 



Pr[it+i g n n Q\ 



exp 



[as - gt+ilJI 



da;. 



Now, for a: £ Q n 



exp 



* 2 lk-^+illl 

2^ 



exp 



2d 2 



= exp (j^ Axf +1 (2x - xt+i - x' t+1 )^j , 

= exp (2Axf +1 (x - x t+ i) - \\Ax t+ i\\l) \ , 

< exp (J^ { 2 \ Ax t+i( x ~ x t+i)\ + l|Aa; t+ i||l)^ 



< exp 



where t\ = T ^ +c and /3 is as given in the Lemma statement. The second last inequality follows from the definition 
of Q and the sensitivity assumption CD- 
Hence, using ©, ©, and (0, we get: 



(6) 



(7) 



Pr[a\ +1 £fi]< 
Hence, proved. 



e £1 exp 



xenng 



+21 1„ 1 1 2 

t 1 1 x t+lll2 

2^ 



(8) 
□ 
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Now, the above lemma shows ( T u^+ e ; y) -differential privacy for each step of Algorithm [TJ Hence, using a simple 
composition argument (see [10]) should guarantee (T 0,5_c - v /e, 5) -differential privacy for all the steps. So to get overall 
e privacy, we will need c = 0.5. That is, a noise of the order 0(T/t) needs to be added at each step, which intuitively 
means that the noise added is larger than the effect of incoming function f t and hence can lead to an arbitrarily bad 
regret. 

To avoid this problem, we need to exploit the interdependence between the iterates (and outputs) of our algorithm 
so as to obtain a better bound than the one obtained by using the union bound. For this purpose, we use the following 
lemma by (H] that bounds the relative entropy of two random variables in terms of the L ro norm of their probability 
density ratio and also a proof technique developed by |[T8l [171 for the problem of releasing differentially private 
datasets. 

Lemma 2 ( 11141 ). Suppose two random variables Y and Z satisfy, 

/ pdf [Y = w] 

w£mpp(Y) \pdf[Z = w] 



D O0 (Y\\Z) = max_ln ^ df z _ J ) < e, Ax>(£||F) < e. 



Then D(Y\\Z) = J" , y \ pdf = w] In ( P df fz—w ] I — ^ g2 " SU PP(Y) i s me support set of a random variable Y. 



lwesup P (Y) P Ui ^ - ^ y-^i[z- = 
We now state a technical lemma which will be useful for our differential privacy proof. 
Lemma 3. Assuming that at each stage t, Algorithm\l}preserves T ^+ c -differential privacy, 

pdf[x t+1 ] 



In 



pdf [x' t , 1 = x t+1 ] 



2e 

< 



where Xt+i and x' t+1 are output of the t-th iteration of the Noise Addition Step of our POCP algorithm (Algorithm^, 
when applied to function sequences Ft and F[ differing in exactly one function entry. 

Proof. Using the fact that x t +\ is r o^U -differential private: 

/ pdf[a f+ i = x] \ < y/e 



V/t - < In 

VX ' T 0.5+c - \pdi[x' t+1 = x]J ~ T - 5 - 

Lemma now follows using the above observation with Lemma|2l □ 

Now we state the privacy guarantee for Algorithm Q] over all T iterations. 

Theorem 1 (POCP Privacy). Let A be an OCP algorithm that satisfies the sensitivity assumption (Q]), then the POCP 
algorithm (see Algorithm^ is (3e,25)-differentially private. 

Proof. Following the notation from the proof of LemmaQ] let Q be defined by ©. Now, using ©, for each round, 

Pr[xt+i g Q\ < (9) 

Now, the probability that the noise vectors bt+i = Xt+i — x t +i = x' t+1 — x' t+1 , 1 < t < T — 1 are all from the 
"good" set Q in all the T rounds is at least 1 — T • ^ = 1 — 5. 

We now condition the remaining proof on the event that the noise vector 6 f+ i in each round is such that Xt+l £ $■ 

LetZ(*i,--- ,x T ) = Tl=i ln ( pdffi[=l t ] )- Using Lemma |3 



T 

Ei ,,.,, T [L (il ,...,^)]=g^ Ki^^j) 



2Te 2e 
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Let Z t = In ^ pdfj^j^it ] ) ■ Since each b t is sampled independently and the randomness in Z t is only due to bt, 
ZfS are independent. We have L(xi, • • • , xt) = z2t=i Zt, where \Z t \ < r o 5+e . By Azuma-Hoeff ding's inequality, 

Pt[L(x u • • • , xt) > 2e + e] < 2 exp ( ~ 2e e \ < 2 exp (-2T 2c ) . 

Setting 5 = 2 exp (— 2T 2c ) , we get c = ^-^^j^ - Hence, with probability at least 1 — 5, 3e-differential privacy holds 
conditioned on x t G Q, i.e, 

Vzi, . . . , z T G M d , nf =1 pdf(it = zt) < e 3e nf =1 pdf(^ = z t ). 

Also, recall that with probability at least 1 — 5, the noise vector bt in each round itself was such that xt 6 Q. Hence, 
with probability at least 1 — 25, 3e-differential privacy holds. (3e, 2<5)-differential privacy now follows using a standard 
argument similar to ((5]). □ 

3.2 Utility (Regret) Analysis for POCP 

In this section, we provide a generic regret bound analysis for our POCP algorithm (see Algorithm [T]). The regret 
bound of POCP depends on the regret 1Z A (T) of the non-private OCP algorithm A. For typical OCP algorithms like 
IGD, GIGA and FTL , TZ A (T) = 0(log T), assuming each cost function f t is strongly convex. 

Theorem 2 (POCP Regret). Let L > be the maximum Lipschitz constant of any function ft in the sequence F, 
1Z A (T), the regret of the non-private OCP algorithm A over T-time steps and X A , the sensitivity parameter of A (see 
dTJj. Then the expected regret of our POCP algorithm (Algorithm^ satisfies: 



E 



T 



£/*(**) 



t=i 



ln 2T 



mm 

xeC 



V f t (x) < 2VdL(\ A + \\c\\ 2 )VT—± + n A (T) 



where d is the dimensionality of the output space, and ||C||2 is the diameter of the convex set C. In other words, the 
regret bound is R A (T) + O ( V dT). 



Proof. Let x\, ■ ■ ■ , xt be the output of the POCP algorithm. By the Lipschitz continuity of the cost functions f t we 
have, 

Y^ftixJ-minJ^ftix) < ^f t {x t )- mm J^ftW + Lj^llxt-Xtlh < R A {T) + L \\xt - x t \\ 2 . (10) 

t=i t=i t=i t=i t=i t=i 

Since at any time t > 1, x t is the projection of x t on the convex set C, we have 

\\x t+ i - X t+ l\\2 < \\xt+\ - X t+ l\\2 = H&t+ilk, VI < t < T - 1, 
where b f+ i is the noise vector added in the t-th iteration of the POCP algorithm. Therefore, 

T / T-l \ 

L^2\\x t - x t \\ 2 < L ||C|| 2 + ^||& m || 2 . (11) 
t=i \ t=i / 

Now, b t+1 ~ M(0 d , Cl d ) where 



e \ 5 T°- 5 + c 

Therefore, 1 1 1 1 1 2 follows Chi-distribution with parameters /i = v ^ 7 p^^ ) 1 ^ 2 ^ and a 2 = jr(d — /f 2 ). 
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j j . In (| In |) 

Using c = l] nT s > , 



||&t+i|| 2 ] < ^ / jdt, 



- r(d/2) 



1 




In 2 £ 

< 2Vd\ A VT — (12) 

The theorem now follows by combining (flOl ). (fTTT) . (fT2l ). □ 
Using Chebyshev's inequality, we can also obtain a high probability bound on the regret. 

Corollary 1. Let L > be the maximum Lipschitz, constant of any function ft in the sequence F, 1Z A (T) , the regret 
of the non-private OCP algorithm A over T-time steps and X A , the sensitivity parameter of A (see ©j. Then with 
probability at least 1 — j,the regret of our Private OCP algorithm (Algorithm^ satisfies: 

T T _, 2 T 

ft(x t ) - mm ft(x) < 2VdL(X A + ||C|| 2 )VT^=§- + K A (T), 
t=i x£ t=i 

where d is the dimensionality of the output space, \\C\\2 is the diameter ofC. 



3.3 Implicit Gradient Descent Algorithm 

In this section, we consider the Implicit Gradient Descent (IGD) algorithm |27l . a popular online convex programming 
algorithm, and present a differentially private version of the same using our generic framework (see Algorithm Q]). 
Before deriving its privacy preserving version, we first briefly describe the IGD algorithm ll27l . 

At each step t, IGD incurs loss ft(xt). Now, given f t , IGD finds the t-th step output x t +\ so that it not "far" away 
from the current solution x t but at the same time tries to minimize the cost f t (x t +i). Formally, 

IGD: x t+ i <- argmin^Ha; - x t \\j + rj t f t (x), (13) 

where squared Euclidean distance is used as the notion of distance from the current iterate. E71 describe a much large 
class of distance functions that can be used, but for simplicity of exposition we consider the Euclidean distance only. 
Assuming each ft(x) is a strongly convex function, a simple modification of the proof by [271 shows 0(log T) regret 
for IGD, i.e. K\ GD {T) = O(logT). 

Recall that our generic private OCP framework can be used to convert any OCP algorithm as long as it satisfies 
low-sensitivity and low-regret assumptions (see £T|), ©). Now, similar to POCP , our Private IGD (PIGD) algorithm 
also adds an appropriately calibrated noise at each update step to obtain differentially private outputs Xt+i- See 
Algorithm |2] for a pseudo-code of our algorithm. 

As stated above, 1Z\gd{T) = O(logT) if each ft{x) is strongly convex. We now bound sensitivity of IGD at each 
step in the following lemma. The proof makes use of a simple and novel induction based technique. 

Lemma 4 (IGD Sensitivity). L2- sensitivity (see Definition^ of the IGD algorithm is ^j- for the t-th iterate, where L 
is the maximum Lipschitz constant of any function f T , 1 < r < t. 

Proof. We prove the above lemma using mathematical induction. 

Base Case (t = 1): As xi is selected randomly, it's value doesn't depend on the underlying dataset. 

Induction Step t = t + 1: As f T is a strongly convex, the strong convexity coefficient of the function f T (x) = 
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Algorithm 2 Private Implicit Gradient Descent (PIGD) 
1: Input: Cost function sequence F = (fi, • • • , fx) and the convex set C 

2: Parameter: privacy parameters (e, 6), maximum Lipschitz constant L and minimum strong convexity parameter 

a of any function in F 
3: Choose x\ and x\ randomly from C 
4: for t = 1 to T - 1 do 
5: Cost: L t (x t ) = ft(x t ) 
6: Learning rate: Vt = 

7: IGD Update: Xt+i a,rgmm xeC (^\\x — x t \\l + r}tft{xj) 

8: Noise Addition: x t+1 <- x t+ i + 6 m , b f+1 ~ M(0 d , ^I d ), where /3 = 2LT°- 5+c ^ (in £ + 5^) and 

_ In \ ln(2/<5) 
C ~~ 2TnT 

9: Output x t+ i = argmin^gc (||a; - 
10: end for 



^ || a? — x T \\\+r] T f T {x) is Now using strong convexity and the fact that at optima x T+ \, (\jf T (x T+ i),x—x T +i) > 
0, Vcc GC, we get: 

t + 1 

/r«+l) > /rfcr+l) + -^H|x r+1 - X^+Jl. (14) 

Now, we consider two cases: 

• F — F' = {/ T }: Define /£(sc) = |||a; — av|| 2 + Vrfr^) an( l let #' T+1 = argmin^gc f' T {x). Then, similar to 
(fT4l) . we get: 

/;(aJ T +i) > + ^ll^+i " (15) 

Adding CEH) and ([151), we get: 

1 2L 

||X T+ 1 - < — ^|/r«+l) + /r(^r+l) ~ /rfcr+l) ~ f'r( X 'r+l)\ < 1^X7 H^+l ~ X 'r+lh- 

Lemma now follows using simplification. 

• F - F' = {fi}, i < t: Define f' T (x) = \\\x - x' T \\ 2 + i] T f T (x) and let x' T+1 = argmin^ f T {x). Then, 
similar to (fT4l . we get: 

;> T +l) > fU<+l) + ^rW^r+l - X' T+1 \\1 (16) 

Adding CEU) and £[6]), we get: 

||aj T+ i - afr+xWl < -^T\\( x r+i ~ x 't+i) ■ ( x r ~ X ' T )\ < ^xyll^+i ~ ^r+ilbll^ - x' T \\ 2 . 
Lemma now follows after simplification and using the induction hypothesis. 

□ 

Using the above lemma and TheoremQ] privacy guarantee for PIGD follows directly. 

Theorem 3 (PIGD Privacy). PIGD (see Algorithm^ is (3e, 25) -differentially private. 

Next, the utility (regret) analysis of our PIGD algorithm follows directly using Theorem |2] along with regret bound 
of IGD algorithm, K\ GD (T) = 0(^- logT + ||C|| 2 ). Regret bound provided below scales roughly as 6(VT). 
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Algorithm 3 Private GIGA (PGIGA) 



1: Input: Cost function sequence F = (fx, - • • , /y) and the convex set C 

2: Parameter: Privacy parameters (e, 6), Lipschitz continuity (L) and strong convexity (a) bound on the function 
sequence F, t q = 2Lq / a 2 

3: Choose xx, • • • , Xt q -x and xx, ■ ■ ■ , Xt q -i randomly from C, incurring a cost of Y^t=i ft( x t) 
4: for t = t q to T - 1 do 
5: Cost: L t {x t ) = ft(x t ) 
6: Step Size: rj t = ^ 

7: GIGA Update: x t +x <- argmin^ (\\x t - rjt V /tO^)!!!) 

8: Noise Addition: x t+1 <- x t+l + b m , 6 t+1 ~ M(0 d , f where /3 = 2GT°- 5+c ^ (ln£ + 

lnAln(2/«5) 

where c = 2 21nT 
9: Output x t+ i = argmin^gc (||ac - 
10: end for 



Theorem 4 (PIGD Regret). Let L be the maximum Lipschitz constant and let a be the minimum strong convexity 
parameter of any function ft in the function sequence F. Then the expected regret of the private IGD algorithm over 
T-time steps is O(VT). Specifically, 

mV^rvi ■ X-n * < r ( (^V^j^g^j 
EIL/tR)]-™),/^)) < c 7= vT , 

U. xec t! V ^ / 

where C > is a constant and d is the dimensionality of the output space. 
3.4 Private GIGA Algorithm 

In this section, we apply our general differential privacy framework to the Generalized Infinitesimal Gradient Ascent 
(GIGA) algorithm li36l . which is one of the most popular algorithms for OCP. GIGA is a simple extension of the 
classical projected gradient method to the OCP problem. Specifically, the iterates x t +i are obtained by a projection 
onto the convex set C, of the output of the gradient descent step x t — rj t V ft&t) where rjt = 1/at, and a is the 
minimum strong convexity parameter of any function f t in F. 

For the rest of this section, we assume that each of the function f t in the input function sequence F are differen- 
tiable, Lipschitz continuous gradient and strongly convex. Note that this is a stricter requirement than our private IGD 
algorithm where we require only the Lipschitz continuity of f t . 

Proceeding as in the previous section, we obtain a privacy preserving version of the GIGA algorithm using our 
generic POCP framework (See Algorithm [T]). Algorithm [3] details the steps involved in our Private GIGA (PGIGA) 
algorithm. Note that PGIGA has an additional step (Step 3) compared to POCP (Algorithm d). This step is required 
to prove the sensitivity bound in Lemma [5] given below. 

Furthermore, we provide the privacy and regret guarantees for our PGIGA algorithm using Theorem Q] and Theo- 
rem |2] To this end, we first show that GIGA satisfies the sensitivity assumption mentioned in dH). 

Lemma 5 (GIGA Sensitivity). Let a > be the minimum strong convexity parameter of any function ft in the 
function sequence F. Also, let Lq be the maximum Lipschitz continuity parameter of the gradient of any function 
ft G F and let G = max T || y / t (x)||2, Vx G C. Then, L2-sensitivity (see Definition]^ of the GIGA algorithm is ^ 
for the t-th iterate, where 1 < t < T. 

Proof. Let £Cf+i and x' t+1 be the t-th iterates when GIGA is applied to F and F', respectively. Using this notation, to 
prove the L2 sensitivity of GIGA, we need to show that: 

II ' II <r 2G 

\\x t+1 - x t+1 \\ < — 



12 



We prove the above inequality using mathematical induction. 

Base Case (1 < t < t g = 2L 2 G /a 2 + 1): As xi, . . . , x tq are selected randomly, their value doesn't depend on the 

underlying dataset. Hence, x t = x' t , VI < t < t q . 

Induction Step t = r > 2L 2 G ja 2 + 1: We consider two cases: 

• F — F' = {/ T }: Since the difference between F and F' is only the r-th function, hence x T = x' T . As C is a 
convex set, projection onto C always decreases distance, hence: 

||£C r+1 - x' T+l \\ 2 < \\{X T -I) T \7 fr{x T )) ~ (X T - 1] T V fr( X r))h, 
= Vr\\ V fr{x T ) ~ \7fr(Xr)\\2, 

< 2G 
ar 

Hence, lemma holds in this case. 

• F — F' = {fi}, i < t: Again using convexity of C, we get: 

||JC T+ 1 - X' T+1 \\1 < \\(X T - rj T V fr(x T )) ~ « - Vr V fr(K))\\h 

= \\Xr ~ X' T \\1 + J^|| V fr(x T ) ~ V/t«)||1 ~ ^lr(x T ~ x' T f f T (x T ) - y/r«)), 
< (1 + VtL 2 G )\\Xt - - 2 Vt ( Xt - x' T ) T ( V fr(x T ) ~ V/r«)), (17) 

where the last equation follows using Lipschitz continuity of y/t- Now, using strong convexity: 

(x T - x' T ) T (yf T (x T ) - S7f T (x' T )) > a\\x T - x' T f 2 . 
Combining the above observation and the induction hypothesis with (fTTT ): 

AG 2 

\\x T+l - x' T+l \\ 2 < (1 + L 2 G r] 2 - 2arj T ) • (18) 

9 2L 2 

Lemma now follows by setting i] T = and r > — 

□ 

Using the lemma above with the privacy analysis of POCP (TheoremQ]), the privacy guarantee for PGIGAfollows 
immediately. 

Theorem 5 (PGIGA Privacy). PGIGA (see Algorithm^ is (3e, 25) -differentially private. 

Next, using the regret bound analysis for GIGA from EUl (Theorem 1) along with Theorem^ we get the following 
utility (regret bound) analysis for our PGIGA algorithm. Here again, ignoring constants, the regret simplifies to 

6(\fdT). 

Theorem 6 (PGIGA Regret). Let a > Obe the minimum strong convexity parameter of any function ft in the function 
sequence F. Also, let Lq be the maximum Lipschitz continuity parameter of the gradient of any function ft £ F and 
let G = max r || v ft(x)\\2, Vx € C. Then, the expected regret o/PGIGA satisfies 

E[RpG1GA(T)1 < wa^ + w^y ^jg + + 2L|o» 

\Je a cr 

where \ \C\ I2 is the diameter of the convex set C and d is the dimensionality of the output space. 

Proof. Observe that for the first t q = —f- iterations PGIGA outputs random samples from C. The additional regret 
incurred during this time is bounded by a constant (w.r.t. T) that appears as the last term in the regret bound given 
above. For iterations t > t q , the proof follows directly by using Theorem |2] and regret bound of GIGA. Note that 
we use a slightly modified step-size rjt = 2/ at, instead of the standard rj t = 1/ at. This difference in the step size 
increases the regret of G IG A as given by |[20l by a factor of 2. □ 
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In Section 13.31 as well this section, we provided examples of the conversion of two standard online learning 
algorithms into privacy preserving algorithms with provably bounded regret. In both these examples, we show low- 
sensitivity of the corresponding learning algorithms and use our analysis of POCP to obtain privacy and utility 
bounds. Similarly, we can obtain privacy preserving variants of many other OCP algorithms such as Follow The 
Leader (FTL), Follow the Regularized Leader (FTRL) etc. Our low-sensitivity proofs should be of independent 
interest to the online learning community as well, as they point to a connection between stability (sensitivity) and 
low-regret (online learnability) — an open problem in the learning community. 

3.5 Logarithmic regret for Quadratic Cost Functions 

In Sections |331 and |3~4l we described two differentially private algorithms with 0{yT) regret for any strongly convex 
Lipschitz continuous cost functions. In this section we show that by restricting the cost functions to a practically 
important class of quadratic functions, we can design a differentially private algorithm to achieve logarithmic regret. 
For simplicity of exposition, we consider cost functions of the form: 

f t { X )= l -{y t - V J X ) 2 + ^\\x\\\ (19) 

for some a > 0. For such cost functions we show that we can achieve O (poly (log T)) regret while providing (e, 5)- 
differential privacy. 

Our algorithm at a high level is a modified version of the Follow the Leader (FTL) algorithm |[20l . The FTL 
algorithm obtains the £-th step output as: 

t 

FTL: x t+ i = &rgmmS2f T (x). (20) 

For our quadratic cost function (fl9l with C = R d , the above update yields 

QFTL : xt+i = (tal + V t )- l (u t ), (21) 

where V t = Vt-\ + v t vf and u t = Ut-i + ytVt with Vq = and uq = 0. Using elementary linear algebra and 
assuming \y t \ < R and ||i>t||2 < R, we can show that ||:c t+ i||2 < 2i?/a,Vi. Now, using Theorem 2 of |[22l along 
with our bound on 1 1 ^c* 1 1 2 •> we obtain the following regret bound for the quadratic loss functions based FTL (QFTL) 
algorithm: 

K Qf MT)< RHl + 2R/a) \ o e T. (22, 

a 

Furthermore, we can show that the QFTL algorithm (see Equation ED also satisfies Assumption Q] Hence, similar to 
Sections [33] and [3T4J we can obtain a differentially private valiant of QFTL with 0(y/T) regret. However, we show 
that using the special structure of QFTL updates (see (O), we can obtain a differentially private variant of QFTL 
with just O (poly (log T)) regret, a significant improvement over 0(y/T) regret. 

The key observation behind our method is that each QFTL update is dependent on the function sequence F through 
Vt and u t only. Hence, computing Vt and u t in a differentially private manner would imply differential privacy for 
our QFTL updates as well. Furthermore, each Vt and u t themselves are obtained by simply adding an "update" to 
the output at step t — 1. This special structure of Vt and u t facilitates usage of a generalization of the "tree-based" 
technique for computing privacy preserving partial sums proposed by lfl3l . Note that the "tree-based" technique to 
compute sums (see Algorithm [5]) adds significantly lower amount of noise at each step than that is added by our 
POCP algorithm (see Algorithm [T]). Hence, leading to significantly better regret. Algorithm [4] provides a pseudo- 
code of our PQFTL method. At each step t, Vt and u t are computed by perturbing Vt and u t (to preserve privacy) 
using PrivateSum algorithm (see Algorithm [5]). Next, Vt and u t are used in the QFTL update (see (I2TI )) to obtain the 
next iterate x t +\. 

Now, we provide both privacy as well as utility (regret bound) guarantees for our PQFTL algorithm. First, we 
prove the privacy of the PQFTL algorithm (Algorithm HJ). 
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Algorithm 4 Private Follow the Leader for Quadratic Cost (PQFTL) 

1: Input: cost function sequence F = (/i, • • • , fx), where each ft(x; y t , v t ) = (yt — vjx) 2 + §||£c||2 

2: Parameter: privacy parameters (e, 5), R = max(max t \\vt H2, max*; \y t \) 

3: Initialize x± = d 

4: Initialize empty binary trees B v and B u , a data structure to compute Vt and u t — differentially private versions 

of Vt and u t 

5: for t = 1 to T - 1 do 

6: Cost: L t (x t ) = f t (x t ) = (y t - vfx t ) 2 + § \\x t \\l 

7: (V t , B v ) <r- PrivateSum^^, B v ,t, R 2 , f , |,T) (see Algorithm© 

8: (il t , B u ) <- PrivateSum(?/i^, t, i?, § , |, T) (see Algorithm© 

9: QFTL Update: x t+1 <- (tal + ^) -1 (tit) 

10: Output 

ll: end for 



Theorem 7 (PQFTL Privacy). Let F be a sequence of quadratic functions, where ft(x; yt-,v t ) = \(yt — V T X ) 2 + 
^||a;|||. Then, PQFTL (Algorithm^ is (e, 6) differential private. 

Proof. Using Theorem [9] (stated in Section [3.5. II ). both Vt and u t are each (|, |) -differentially private w.r.t. u t and 
yt, Vt and hence w.r.t. the function sequence F. Now, x t+ \ depends on F only through [Vt, u t \. Hence, the theorem 
follows using a standard composition argument ifTTlfTOl . □ 

Next, we provide regret bound analysis for our PQFTL algorithm. 

Theorem 8 (PQFTL Regret). Let F be a sequence of quadratic functions, where ft(x;yt,Vt) = \(yt — v Jx) 2 + 
\x\ \ 2 . Let R be the maximum L2 norm of any Vt and \yt\. Then, the regret bound o/PQFTL (Algorithm^ satisfies 
(w.p. > 1 - exp(-d/2)J: 

^pqftl(T) = 6 [^i^iog^rV 



Proof. Using definition of regret, 

T 



■&PQFTL = ^2ft(x t ) - argmin^/ t (£C*) = ^f t (x t ) - argmin^ f t (x*) + ^(f t (xt) - ft(x t )), 
t=i x * t=i t=\ x * t=i t=i 

T 

< ^qftl(T) + J2(ft(x t ) - ft(xt)), 
t=l 

R A (l + 2R/a) 2 , m *L - , \ \ 

< — ^logT + V (f t (x t )-f t (x t , 



t=t 



(23) 



where last inequality follows using (1221) . 

Now, as ft(x) is a (R + a)-Lipschitz continuous gradient function, 

ft(x t ) - ft(x t ) < ((vfx t - y t )v t + ax t ) T (x t - x t ) + ^"" pt ~ x t\\ 2 , 

< R(2R 2 /a + R + 2)\\x t - x t \\ + — ^||«t - x t f, (24) 

where last inequality follows using Cauchy-Schwarz inequality and the fact that \\x t \\2 < 2R/a. 

We now bound ||^+i — a^t+i 1 12- Let Vt = Vt + A t and u t = u t + where A t and /3 4 are the noise additions 
introduced by the Private Sum algorithm (Algorithm [5]). 
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Now, from the step 9 of PQFTL (Algorithm [4]) we have, 



(V t + tal)x t+1 =u t 4^ (jV t + al)x t+ i = jU t , (25) 
Similarly, using QFTL update (see (EH ) we have, 

(jV t + al)x t+1 = ju t . (26) 

Using (|25]> and (ggj: 

(jV t + al)(x t+l - x t+ i) = -p t - -A t x t+1 . (27) 
Now, using Vt = Vt + At and the triangle inequality we have, 

\\(-Vt + al)(x t+1 - x t+1 )\\ 2 > \\(-V t + al)(x t+1 - x t+1 )\\ 2 - \\jA t (x t+1 - x t+ i)\\ 2 (28) 



Furthermore, 

\\jA t (x t+1 - x t+ i)\\ 2 < j\\At\\ 2 \\x t+ i - x t+ i\\ 2 (29) 

Thus by combining (l27l) . (|28T ). (l29l) and using the fact that the smallest eigenvalue of (\Vt + al) is lower-bounded by 

a, 

j\\Pt\\2 + ^\\At\\ 2 \\x t+ i\\ 2 >\a- A * 2 \\\xt+i - xt+x\\ 2 (30) 

Now using Theorem|9]each entry of the matrix A t is drawn from Af(0, a 2 log T) for a 2 = — log 2 T log ^f^-. Thus the 
spectral norm of A t , \\A t \\ 2 is bounded by 3aVd with probability at least 1 — exp(— d/2). Similarly, \\(3t\\ 2 < 3aVd, 
with probability at least 1 — exp(— d/2). Also, ||a?t||2 < 2R/a. Using the above observation with (l30l . 

cr^d 3 + 6i?/a 

l|a;f+1 - a;<+l||2 ^^- | a , <W5fl| - (31) 
Using (|23]>, (EH), and ([3B, we get (with probability at least 1 - exp(-d/2)): 



ftpQFTL(T) < fi4(1 + 2jR/a)2 logT+3^(2fi 2 /«+^+2)(l + 2^/a)(l+logT)^7loiriog (32) 
a ye V o 

Hence w.h.p., 

'n % log 1 



ftpQFTL(T) = 6 i ° 5 ^dbg 15 T 1 . 



□ 



3.5.1 Computing Partial Sums Privately 

In this section, we consider the problem of computing partial sums while preserving differential privacy. Formally, let 
D = (w\,w 2 , • • • , wt) be a sequence of vectors, where at each time step t, a vector wt 6 M. d is provided. Now the 
goal is to output partial sums Wt = Y1t=i w r at eac h ti me ste P without compromising privacy of the data vectors 
in D. Note that by treating a matrix as a long vector obtained by row concatenation, we can use the same approach to 
compute partial sums over matrices as well. 

Now, note that L 2 -sensitivity of each partial sum is O(R) (R = max t \\w t \\ 2 ), as changing one w T can change a 

partial sum by an additive factor of 2R. Hence, a naive method is to add 0(R\f^^-) noise at t-th to obtain (e, 5)- 
privacy for a fixed step t. Using standard composition argument, overall privacy of such a scheme over T iterations 
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W1 w2 w3 w4 w5 wB w 7 w1 w2 w3 w4 w5 wB w 7 wB 



(a) (b) 

Figure 1: Binary Tree for T = 8. Each node in the tree has noise drawn from jV(0, a 2 I d ) including the leaves. The 
edge labels on the path from root to any node form the label for that node, (a): w±, w%, .., are the input vectors 
that have arrived till time step t = 7. Each internal node is obtained by adding noise from Af(0, cr 2 l d ) to the sum of 
input vectors in the sub-tree rooted at the node. To return the partial sum at t = 7, return the sum of the nodes in thick 
red. The dotted nodes are unpopulated, (b): The figure depicts the change in the data structure after the arrival of wg. 
Now the partial sum at t = 8 is obtained by using just one node denoted in thick red. 



would be (Te, TS). Hence, to get a constant (e 7 , 8') privacy, we would need to add 0(RVTy ° S e F ) noise. In contrast, 
our method, which is based on a generalization of |[T3ll . is able to provide the same level of privacy by adding only 

0(-RlogTy j& — ) noise. We first provide a high level description of the algorithm and then provide a detailed 

privacy and utility analysis. 

Following lfl3l . we first create a binary tree B where each leaf node corresponds to an input vector in D. We 
denote a node at level i (root being at level 0) with strings in {0, 1}* in the following way: For a given node in level i 
with label s G {0, 1}\ the left child of s is denoted with the label s o and the right child is denoted with sol. Here 
the operator o denotes concatenation of strings. Also, the root is labeled with the empty string . 

Now, each node s in the tree B contains two values: B s and B s , where B s is obtained by the summation of 
vectors in each of the leaves of the sub-tree rooted at s, i.e., B s = j-.j=sor w j- Also, B s = B s + b s is a 

re{0,l} fc_i 

perturbation of B s , b s ~ A/"(0, a 2 I d ), and a is as given in Lemma[6] 

A node in the tree is populated only when all the vectors that form the leaves of the sub-tree rooted at the node 
have arrived. Hence, at time instant t we receive vector w t and populate the nodes in the tree B for which all the 
leaves in the sub-tree rooted at them have arrived. To populate a node labeled s, we compute B s = B so q + B so i, the 
sum of the corresponding values at its two children in the tree and also B s = B s + b s , b s ~ Af(0, a 2 I d ). 

As we prove below in Lemma[6l for a i-th level node which is populated and has label s G {0, 1}*, B s contains 
an (e, 5)-private sum of the 2 k ~ l vectors that correspond to the leaves of the sub-tree rooted at s. Now, to output a 
differentially private partial sum at time step t, we add up the perturbed values at the highest possible nodes that can 
be used to compute the sum. Note, that such a summation would have at most one node at each level. See Figure Q] 
for an illustration. We provide a pseudo-code of our method in Algorithm [5] 

Theorem [9] states privacy as well as utility guarantees of our partial sums method (Algorithm [5]). We first provide 
a technical lemma which we later use in our proof of Theorem |9l 

Let B(D) denote the set of all perturbed node values B s ,\/s obtained by applying Algorithm [5] on dataset D. 
Also, D and D' be two datasets that differ in at most one entry, say Wt- 

Lemma 6. Let B s (D) = B s (D) + b s , where b s ~ M(0, a 2 l d )for a 2 = ^- log 2 T log Then, for any t and any 

@ s G R d , 

pdf[B s (D) = O s ] < pdf [B a (D') = @ s ] + 

logT 

where D and D' are two datasets differing in exactly one entry. 
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Algorithm 5 Private Sum(iu t , B, t, R, e, 6, T) 



Require: Data vector w t , current binary tree B, current vector number t, R a bound on \\wt\ I2, privacy parameters e 
and 5, total number of vectors T, dimensionality of vectors d 
1: if t = 1 then 

2: Initialize the binary tree B over T leaves with all nodes 

3: a 2^ log 2 Tlog logT 

4: end if 

5: st <— the string representation of t in binary 

6: B St <— w t //Populate the s f -th entry of B 

7: B st <- B St + b St , where & St ~ /V(0, a 2 I d ) 

8: Let 5* is the set of all ancestors s of s t in the tree B, such that all the leaves in the sub-tree rooted at s are already 

populated 
9: for all s £ S t do 

10: B s <— B so q + B so \ II B s is the value at node with label s (without noise) 

11: B s <— B s + b s , where 6 S ~ /V(0, cr 2 I rf ) // B s is the noisy value at node with label s 

12: end for 

13: Find the minimum set of already populated nodes in B that can compute X^*=i Wr - F° rmai ly> stalling from the 
left, for each bit position i in s t such that st{i) = 1, form strings s 9 = st(l) o ... o s t (i — 1) o of length i Let 
s 1 , s 2 , 5*2 be all such strings, where Q < logT. For example, if st = 110 then the strings obtained this way 
are: and 10 

14: Output: (W t = Eq=iBs«,B) 



Proof. Let A = B S (D) - B S (D'). Note that ||A|| 2 < R. Now, consider the following ratio: 

pdf[S s ( J D) = 9 s ] _ exp H e -- 2 y i _ ||A||2-2A r (S s ( J D')-0, 



< exp ^2 • ( 33 ) 

Now, A T (B S (D') - G s ) follows AA(0, || A|||cj 2 ). For a random variable V ~ 7V(0, 1), and for all 7 > 1, pdf [\V \ > 
7] < e -7 '/ 2 ( Mill's inequality ). Thus, 

2 

pdf[|A T (B fl (D') - 9 S )| > Raj] < pdf[\A T (B S (D') - 6 S )| > \\A\\ 2 a 7 } < exp(^) 



Lemma follows by setting 7 = 2 Win in the equation above and combining it with (l33l) . □ 

Next, we provide formal privacy and utility guarantees for Algorithm [5] Our proof is inspired by a technique 
developed by |[T3l . 

Theorem 9 (Algorithm |5J Privacy and Utility). Let D = (w\, ■ ■ ■ , wt) be a dataset of vectors with w t € M. d being 
provided online at each time step t. Let R = maxj<r | |xOj| | 2 and a 2 = — log 2 T log ^f^-- Let Wt = Y2r=i w r be 
the partial sum of the entries in the dataset D till the t-th entry. Then, Vt £ [T], following are true for the output of 
Algorithm\5\with parameters (t, e, 5, R, T). 

• Privacy: The output Wt is (e, 8)-differentially private. 

• Utility: The output Wt has the following distribution: Wt ~ Af(Wt, ka 2 Id), where k < [logT] . 

Proof. Utility: Note that Line [14] of the Algorithm [5] adds at most [log T] vectors B s (corresponding to the chosen 
nodes of the binary tree B). Now each of the selected vectors B s is generated by adding a noise b s ~ A/"(0, a 2 I d ). 
Furthermore, each b s is generated independent of other noise vectors. Hence, the total noise in the output partial sum 
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Wt has the following distribution: 7V(0, ka 2 Id), where k < [log T] . 

Privacy: First, we prove that B(D) is (e, 8) -differentially private. As defined above, let D and D' be the two 
datasets (sequences of input vectors) that differ in exactly one entry. Let S C R 2T_1 . Now, 

Pr[B(D) £ S] _ f eeS pdi[B(D) = 9] 
Pr[B(D')eS] J eesP di[B(D') = @] 

Note that noise (b s ) at each node s is generated independently of all the other nodes. Hence, 

pdfJgCg) = 9] = n s pdf[B s (D) = Q s } 
pdf [B(D') = 9] n, pdf[B s (£>') = e a ] ' 

Since D and Z)' differ in exactly one entry, B{D) and B(D') can differ in at most logT nodes. Thus at most logT 
ratios in the above product can be different from one. Now, by using Lemma [6] to bound each of these ratios and then 
using composability argument ifTTlfTOl over the log T nodes which have differing values in B{D) and B(D'), 

Pr[B(D) = 9] < e e Pr[B(D') e 9] + 5, 

i.e., B(D) is (e, 8) -differentially private. 

Now, each partial sum is just a deterministic function of B{D). Hence, (e, 8) -differential privacy of each partial 
sum follows directly by (e, 8) -differential privacy of B(D). □ 



4 Discussion 

4.1 Other Differentially Private Algorithms 

Recall that in Section [331 we described our Private IGD algorithm that achieves O(VT) regret for any sequence of 
strongly convex, Lipschitz continuous functions. While, this class of functions is reasonably broad, we can further 
drop the strong convexity condition as well, albeit with higher regret. To this end, we perturb each f t and apply IGD 
over ft = ft + 1 1 x — xq 1 1 2, where xq is randomly picked point from the convex set C. We can then show that under 

this perturbation "trick" we can obtain sub-linear regret of 0(T 3 / 4 ). The analysis is similar to our analysis for IGD 
and requires a fairly straightforward modification of the regret analysis by |[27l . 

We now briefly discuss our observations about the Exponentially Weighted Online Optimization (EWOO) ETTl . 
another OCP algorithm with sub-linear regret bound. This algorithm does not directly fit into our Private OCP 
framework, and is not wide-spread in practice due to relatively inefficient updates (see ET1 for a detailed discussion). 
However, just for completeness, we note that by using techniques similar to our Private OCP framework and using 
exponential mechanism (see QUI ), one can analyze this algorithm as well to guarantee differential privacy along with 
d(Vr) regret. 

4.2 Application to Offline Learning 

In Section [3) we proposed a generic online learning framework that can be used to obtain differentially private online 
learning algorithms with good regret bounds. Recently, |[23l showed that online learning algorithms with good regret 
bounds can be used to solve several offline learning problems as well. In this section, we exploit this connection to 
provide a generic differentially private framework for a large class of offline learning problems as well. 

In a related work, Q also proposed a method to obtain differentially private algorithms for offline learning prob- 
lems. However, as discussed later in the section, our method covers a wider range of learning problems, is more 
practical and obtains better error bounds for the same level of privacy. 

First, we describe the standard offline learning model that we use. In typical offline learning scenarios, one receives 
(or observes) a set of training points sampled from some fixed distribution and also a loss function parametrized by 
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Algorithm 6 Private Offline Learning (POL) 
1: Input: Input dataset D = (zi, • • • , Zt) and the convex set C 

2: Parameter: Privacy parameters (e p , 5), generalization error parameter e g , Lipschitz bound L on the loss function 

i, bound on ||a;*|| 2 
3: If C = Tl d then set C = {x : x G TZ d , \\x\\ 2 < ||a;*|| 2 }. 
4: Choose x\ randomly from C 

5: Set a <- 

\\ x \\2 

6: Initialize s = x\ 

7: for t = 1 to T - 1 do 

8: Learning rate: Vt = -^t 

9: IGD Update: aj t +i <— argmin xeC (^\\x — x t \\ 2 + r] t (£(x; z t ) + fH^Hl)) 
10: Store sum: s <— s + a^+i 
ll: end for 
12: Average: x «— |j 

13: Noise Addition: x i + b, where b ~ Af(0 d , /3 2 I d ) and /3 = 2V ^ {L+ ^ h)lnT ^Jln^ + e p 
14: Output A = argmhXj.gc — 



some hidden parameters. Now, the goal is to learn the hidden parameters such that the expected loss over the same 
distribution is minimized. 

Formally, consider a domain Z and an arbitrary distribution T>z over Z from which training data is generated. Let 
D = (zi, • • • , zt) be a training dataset, where each Z{ is drawn i.i.d. from the distribution T>z- Also, consider a loss 
function £ : C x Z ^ R + , where C C W 1 be a (potentially unbounded) convex set. Let £(■;■) be a convex function, 
L-Lipschitz in both the parameters and let £(0; z) < 1, Vz G 2. Intuitively, the loss function specifies goodness of a 
learned model sGC w.r.t. to the training data. Hence, the goal is to solve the following minimization problem (also 
called Risk Minimization): 

mmE z ^ Vz [£(x;z)}. (34) 

Let x* be the optimal solution to (l34l . i.e., a;* = arg min^gc "& z ~v z [&{ x '-> z )\- Recently, |[23l provided an algorithm 
to obtain an additive approximation to (l34l via online convex programming (OCP). The algorithm of f23l is as 
follows: execute any reasonable OCP algorithm A (like IGD or GIGA) on the function sequence F = (£(x; z\) + 
^||cc|| 2 ,-- - , £(x; zt) + f ||«|| 2 ) in an online fashion. Furthermore, if the set C is an unbounded set, then it can be set 
to be an L 2 ball of radius ||£E*|| 2 , i.e, 

C = {x : x G R d , \\x\\ 2 < \\x*\\ 2 }. 

Now, let xi, ■ ■ ■ ,xt be the sequence of outputs produced by A. Then, output x = ^ Ylt=l x t as an approximation 
for x*. TheoremfTTIbounds additional error incurred by x in comparison to x*. Next, to produce differentially private 
output we can add appropriate noise to the output x . We present a detailed pseudo-code in Algorithm^ For simplicity 
of presentation, we instantiate our framework with the IGD algorithm as the underlying OCP algorithm. 
First, we show that POL (Algorithm [6]) is (e, 5) -differentially private. 

Theorem 10 (POL Privacy). The Private Offline Learning (POL) algorithm (see Algorithm® is (e p , 5) -differentially 
private. 

Proof. Recall that to prove differential privacy, one needs to show that changing one training points from the dataset 
D will not lead to significant changes in our algorithm's output x which is a perturbation of x = ^ Ylt=i x t- Hence, 
we need to show that the L 2 -sensitivity (see Definition [3]) of x is low. 

Now let x[ , • • • , x' T be the sequence of outputs produced by the IGD algorithm used in Algorifhm[6]when executed 
on a dataset D' which differs in exactly one entry from D. To estimate the sensitivity of x, we need to bound 
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|i J2t=i( x t ~ x t)\h- Now, using triangle inequality and LemmaHJ we get: 

1 2L' 2L In T 



1A, .... If,. ... If 2V 2L In T 

- J> t - ^)|| 2 < - £ ||^ - x' t \\ 2 < - Y, — < — y— . (35) 



T 

t=l t=l t=2 



where L' is the maximum Lipschitz continuity coefficient of £(x, z t ) + ^||a;|||,Vt over the set C. Using the fact that 
|a = || a;* H2, we obtain L' = L + a||a;* ||2- 

The theorem now follows using Z/2-sensitivity of x (see (I35T )) and a proof similar to that of LemmaQ] □ 



With the privacy guarantee in place, we now focus on the utility of Algorithm [6j i.e., approximation error for the 

T 



Risk Minimization problem (l34l ). We first rewrite the approximation error incurred by x = ^ Yll=i x t-> as derived by 



231. 



Theorem 11 (Approximation Error in Risk Minimization (Eq. l34l) 11231 ). Let TZ^(T) be the regret for the online 
algorithm A. Then with probability at least 1 — 7, 



TZa(T) 4 I L' 2 TZ A (T)ln(^^) max{±^, 6} ln(^) 

< ill*T + + — a 7 + 



«,. *,|2 



where L' = L + a\\x* {{2, L is the Lipschitz. continuity bound on the loss function £ and a is the strong convexity 
parameter of the function sequence F. 

Theorem 12 (POL Utility (Approximation Error in Eq. [34l). Let L is the Lipschitz bound on the loss function £ and T 
be the total number of points in the training dataset D = {z\, . . . , Zt}- Let (e p , 5) be differential privacy parameters, 
and d be the dimensionality. Then, with probability at least 1 — 7, 

E z ^ Vz [£{x;z)} - mmE z ^v z [£(x; z)] < e 9 , 
when the number of points sampled (T) follows, 

'VdL(L + e g /\\x*\\ 2 )J]n±ln± ( L + e g /\\x*\\ 2 ) 2 \\x*\\ 2 2 In Tin ^ 



T > C max 



f 2 



where C > is a global constant. 

Proof. To prove the result, we upper bound E z ^ z [£(x; z)\ — E Zr ^D z [£(x*; z)] as: 
E z ^[£(a-;z)] -E^ z [^^ 

< L\\x - x\\ 2 +E z ^v z [£(x; z) -£(x*;z)], 

= L\ \b\ | 2 + E z „ Vz [£(x; z) - £{x* ; z)] , (36) 

where the second inequality follows using Lipschitz continuity of £ and the last equality follows by the noise addition 
step (Step 13) of Algorithm 

From the tail bound on the norm of Gaussian random vector, it follows that with probability at least 1 — Z, 



/I 1 r T } /I 1 

H&H2 < 3\/d/3Wln- < l2VdL'— Win -In-, (37) 
V 7 Te p V 7 ^ 

where V = L + e g /\\x* \\ 2 , L is the Lipschitz continuity parameter of £. Note that in Line 5 of Algorithm [6] we set the 

strong convexity parameter a = ,, e f na . 

\\ x 1I3 

Now, regret bound of IGD is given by: 

J RiGD(T) = 0(e 9 + -lnr), (38) 

a 
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Thus, by combining (l36l ). (I37T ). (I38T ). and Theorem [TT] with probability at least 1 — 7, 



E z ~ Vz [£(x;z)]-mmE z „ Vz [£(x;z)\ < ^-+C 



VdL(L + 



X 



^ Ir )ln7\/ln±ln4 



fC 



(L+j^?\\x*\\l\nT\n^ 



where C > is a global constant. 

The result now follows by bounding the RHS above by e g . 



□ 



We note that although our Algorithm [6] and analysis assumes that the underlying OCP algorithm is IGD, however 
our algorithm and analysis can be easily adapted to use with any other OCP algorithm by plugging in the regret bound 
and L2 sensitivity of the corresponding OCP algorithm. 

Comparison to existing differential private offline learning methods: Recently, [3] proposed two differentially 
private frameworks for a wide range of offline learning problems, namely, output perturbation and objective pertur- 
bation. However, our method has three significant advantages over both the methods of Q: 

• Handles larger class of learning problems: Note that both privacy analysis (Theorem [TOl) and utility analysis 
(Theorem \Y2\ only require the loss function I to be a convex, Lipschitz continuous function. In fact, the loss 
function is not required to be even differentiable. Hence, our method can handle hinge loss, a popular loss 
function used by Support Vector Machine (SVM). In comparison, |3] requires the loss function i to be twice 
differentiable and furthermore, the gradient should be Lipschitz continuous. 

Furthermore, our method can be used for minimizing risk (see (l34l >) over any fixed convex constraint set C. In 
contrast, requires the set C to be the complete vector space M. d . 

• Better error bound: Theorem 18 of [3| bounds the sample size by T = (^¥^ + «), Which is same 

as our bound (see Theorem [T2l except for an additional y/d factor. Hence, our analysis provides tighter error 
bound w.r.t. dimensionality of the space. We believe the difference is primarily due to our usage of Gaussian 
noise instead of Gamma noise added by Q- 

• More practical: Our method provides an explicit iterative method for solving (l34l ) and hence provides differ- 
ential privacy guarantees even if the algorithm stops at any step T. In contrast, [3] assumes optimal solution to 
a certain optimization problem, and it is not clear how the differential privacy guarantees of extends when 
the optimization algorithm is forced to halt prematurely and hence might not give the optimal solution. 

In a related work, j33l also proposed a differentially private framework for offline learning. However, ll33l compares 
the point- wise convergence of the obtained solution x to the private optimum of true risk minimizer x*, where as Q 
and our method (see Algorithm [6]) compare the approximation error; hence, results of IT331 are incomparable to our 
results. 

5 Empirical Results 

In this section we study the privacy and utility (regret) trade-offs for two of our private OCP approaches under 
different practical settings. Specifically, we consider the practically important problem of online linear regression 
and online logistic regression. For online linear regression we apply our PQFTL approach (see Algorithm HJ) and 
for online logistic regression we apply our PIGD method (see Algorithm [2]). For both the problems, we compare 
our method against the offline optimal and the non-private online version and show the regret/accuracy trade-off with 
privacy. We show that our methods learn a meaningful hypothesis (a hyperplane for both the problems) while privacy 
is provably preserved due to our differential privacy guarantees. 

5.1 Online Linear Regression (OLR) 

Online linear regression (OLR) requires solving for x t at each step so that squared error in the prediction is minimized. 
Specifically, we need to find x t in an online fashion such that Yltivt ~ 9t x t) 2 + "ll 33 *!! 2 * s minimized. OLR is a 
practically important learning problem and have a variety of practical applications in domains such as finance ll26l . 
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Online Linear Regressioi 



— Non-private FTL 
-Private FTL (e=10, 5=0.01) 
-Private FTL (&=1, 8=0.01) 
-Private FTL (£=0.1 , 5=0.01) 
-Private FTL (£=.01,5=0.01) 



Number of Iteratio; 



(a) 





—Non-private FTL 

— Private FTL (6=10, 8=0.01) 

— Private FTL (6=1 , 5=0.01 ) 




Private FTL (6=0.1, 5=0.01) 
— Private FTL (e=.01, 5=0.01) 


lr^^~ 





Number of Iterations 

(b) 



Method 


Accuracy 


Non-private IGD 


68.1% 


PIGD (e = 20,6 = 0.01) 


66.3% 


PIGD (e = 10,6 = 0.01) 


62.7% 


PIGD(e= 1,5 = 0.01) 


59.4% 


PIGD (e = 0.1,6 = 0.01) 


58.3% 



(c) 



Figure 2: Privacy vs Regret, (a), (b): Average regret (normalized by the number of iterations) incurred by FTL and 
PQFTL with different levels of privacy e on the synthetic 10-dimensional data and Year Prediction Data. Note that 
the regret is plotted on a log-scale. PQFTL obtained regret of the order of le — 2 even with high privacy level of 
e = 0.01. (c): Classification accuracy obtained by IGD and PIGD algorithm on Forest-covertype dataset. PIGD 
learns a meaningful classifier while providing privacy guarantees, especially for low privacy levels, i.e., high e. 



Now, note that we can directly apply our PQFTL approach (see Section 1331) to this problem to obtain differentially 
private iterates x t with the regret guaranteed to be logarithmic. Here, we apply our PQFTL algorithm for the OLR 
problem on a synthetic dataset as well as a benchmark real- world dataset, namely "Year Prediction" fi31 . For the 
synthetic dataset, we fix x* , generate data points g t of dimensionality d = 10 by sampling a multivariate Gaussian 
distribution and obtain the target y t = gfx* + rj, where rj is random Gaussian noise with standard variance 0.01. 
We generate T = 100, 000 such input points and targets. The Year Prediction dataset is 90-dimensional and contains 
around 500, 000 data points. For both the datasets, we set a = 1 and at each step apply our PQFTL algorithm. 
We measure the optimal offline solution using standard ridge regression and also compute regret obtained by the 
non-private FTL algorithm. 

Figure [2] (a) and (b) shows the average regret(i.e., regret normalized by the number of entries T) incurred by 
PQFTL for different privacy level e on synthetic and Year Prediction data. Note that the y-axis is on the log-scale. 
Clearly, our PQFTL algorithm obtains low-regret even for reasonable high privacy levels (e = 0.01). Furthermore, 
the regret gets closer to the regret obtained by the non-private algorithm as privacy requirements are made weaker. 

5.2 Online Logistic Regression 

Online logistic regression is a variant of the online linear regression where the cost function is logistic loss rather 
than squared error. Logistic regression is a popular method to learn classifiers, and has been shown to be successful 
for many practical problems. In this experiment, we apply our private IGDalgorifhm to the online logistic regression 
problem. To this end, we use the standard Forest cover-type dataset, a dataset with two classes, 54-dimensional feature 
vectors and 581, 012 data points. We select 10% data points for testing purpose and run our Private IGD algorithm on 
the remaining data points. Figure [2] (c) shows classification accuracy (averaged over 10 runs) obtained by IGD and 
our PIGD algorithm for different privacy levels. Clearly, our algorithm is able to learn a reasonable classifier from 
the dataset in a private manner. Note that our regret bound for PIGD method is 0(yT), hence, it would require more 
data points to reduce regret to very small values, which is reflected by a drop in classification accuracy as e decreases. 



6 Conclusions 

In this paper, we considered the problem of differentially private online learning. We used online convex programming 
(OCP) as the underlying online learning model and described a method to achieve sub-linear regret for the OCP 
problem, while maintaining (e, ^-differential privacy of the data (input functions). Specifically, given an arbitrary 
OCP algorithm, we showed how to produce a private version of the algorithm and proved the privacy guarantees by 
bounding the sensitivity of the algorithm's output at each step t. We considered two well known algorithms (IGD and 
GIGA) in our framework and provided a private version of each of the algorithm. Both of our differentially private 
algorithms have O(VT) regret while guaranteeing (e, 5) differential privacy. We also showed that for the special 
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case of quadratic cost functions, we can obtain logarithmic regret while providing differential privacy guarantees 
on the input data. Finally, we showed that our differentially private online learning approach can be used to obtain 
differentially private algorithms for a large class of convex offline learning problems as well. Our approach can handle 
a larger class of offline problems and obtains better error bounds than the existing methods 0. 

While we can provide logarithmic regret for the special class of quadratic functions, our regret for general strongly 
convex functions is 0(VT). An open question is if the 0(y/T) bound that we obtain is optimal or if it can be 
further improved. Similarly, another important open question is to develop privacy preserving techniques for the OCP 
problem that have a poly-logarithmic dependence on the dimension of the data. Finally, another interesting research 
direction is an extension of our differentially private framework from the "full information" OCP setting to the bandit 
setting. 

Acknowledgments. We would like to thank Ankan Saha, Adam Smith, Piyush Srivastava and Ambuj Tewari for 
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References 

[1] Zeller Tom Barbara Michael. A face is exposed for aol searcher no. 4417749. New York Times, 2006. 

[2] Avrim Blum, Katrina Ligett, and Aaron Roth. A learning theory approach to non-interactive database privacy. 
In STOC, pages 609-618, 2008. 

[3] Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. Differentially private empirical risk minimiza- 
tion. /. Mach. Learn. Res., 999999:1069-1109, July 2011. 

[4] Anindya De. Lower bounds in differential privacy. CoRR, abs/1 107.2183, 2011. 

[5] Mt Dinur and Kobbi Nissim. Revealing information while preserving privacy. In PODS, pages 202-210, 2003. 

[6] Cynthia Dwork. Differential privacy. In ICALP, LNCS, pages 1-12,2006. 

[7] Cynthia Dwork. The differential privacy frontier (extended abstract). In TCC, pages 496-502, 2009. 

[8] Cynthia Dwork. Differential privacy in new settings. In SODA, pages 174-183, 2010. 

[9] Cynthia Dwork, Krishnaram Kenthapadi, Frank Mcsherry, Ilya Mironov, and Moni Naor. Our data, ourselves: 
Privacy via distributed noise generation. In In EUROCRYPT, pages 486-503. Springer, 2006. 

[10] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the 41st annual ACM 
symposium on Theory of computing, STOC '09, pages 371-380, New York, NY, USA, 2009. ACM. 

[11] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private 
data analysis. In TCC, pages 265-284, 2006. 

[12] Cynthia Dwork, Frank McSherry, and Kunal Talwar. The price of privacy and the limits of lp decoding. In 
Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, STOC '07, pages 85-94, New 
York, NY, USA, 2007. ACM. 

[13] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual obser- 
vation. In STOC, pages 715-724, 2010. 

[14] Cynthia Dwork, Guy N. Rothblum, and Salil P. Vadhan. Boosting and differential privacy. In FOCS, pages 
51-60, 2010. 

[15] A. Frank and A. Asuncion. UCI machine learning repository, 2010. 



24 



[16] Srivatsava Ranjit Ganta, Shiva Prasad Kasiviswanathan, and Adam Smith. Composition attacks and auxiliary 
information in data privacy. In KDD, pages 265-273, 2008. 

[17] Moritz Hardt, Katrina Ligett, and Frank McSherry. A simple and practical algorithm for differentially private 
data release. CoRR, abs/1012.4763, 2010. 

[18] Moritz Hardt and Guy N. Rothblum. A multiplicative weights mechanism for privacy -preserving data analysis. 
In FOCS, pages 61-70, 2010. 

[19] Moritz Hardt and Kunal Talwar. On the geometry of differential privacy. In Proceedings of the 42nd ACM 
symposium on Theory of computing, STOC '10, pages 705-714, New York, NY, USA, 2010. ACM. 

[20] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. 
Mack Learn., 69:169-192, December 2007. 

[21] Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. 
Machine Learning, 69:169-192, 2007. 10.1007/sl0994-007-5016-8. 

[22] Sham Kakade and Shai Shalev-Shwartz. Mind the duality gap: Logarithmic regret algorithms for online opti- 
mization. In Neural Information Processing Systems, pages 1457-1464, 2008. 

[23] Sham M. Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming 
algorithms. In Neural Information Processing Systems, pages 801-808, 2008. 

[24] Adam Kalai and Santosh Vempala. Efficient algorithms for universal portfolios. J. Mach. Learn. Res., 3:423- 
440, March 2003. 

[25] Shiva Prasad Kasiviswanathan, Mark Rudelson, Adam Smith, and Jonathan Ullman. The price of privately 
releasing contingency tables and the spectra of random matrices with correlated rows. In Proceedings of the 
42nd ACM symposium on Theory of computing, STOC '10, pages 775-784, New York, NY, USA, 2010. ACM. 

[26] Jyrki Kivinen and Manfred Warmuth. Exponentiated gradient versus gradient descent for linear predictors. 
Technical report, University of California at Santa Cruz, Santa Cruz, CA, USA, 1994. 

[27] Brian Kulis and Peter L. Bartlett. Implicit online learning. In ICML, pages 575-582, 2010. 

[28] Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramaniam. 1- 
diversity: Privacy beyond k-anonymity. In ICDE, page 24, 2006. 

[29] Shantanu Rane Manas Pathak and Bhiksha Raj. Multiparty differential privacy via aggregation of locally trained 
classifiers. In NIPS, 2010. 

[30] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In Proceedings of the 48th 
Annual IEEE Symposium on Foundations of Computer Science, pages 94-103, Washington, DC, USA, 2007. 
IEEE Computer Society. 

[31] Arvind Narayanan and Vitaly Shmatikov. Robust de-anonymization of large sparse datasets. In Proceedings 
of the 2008 IEEE Symposium on Security and Privacy, pages 111-125, Washington, DC, USA, 2008. IEEE 
Computer Society. 

[32] Erik Ordentlich and Thomas M. Cover. On-line portfolio selection. In Proceedings of the ninth annual conference 
on Computational learning theory, COLT '96, pages 310-313, New York, NY, USA, 1996. ACM. 

[33] Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft. Learning in a large function space: 
Privacy-preserving mechanisms for svm learning. CoRR, abs/091 1.5708, 2009. 

[34] Latanya Sweeney, /c-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness 
and Knowledge-based Systems, 2002. 



25 



[35] Oliver Williams and Frank McSherry. Probabilistic inference and differential privacy. In NIPS, 2010. 

[36] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages 
928-936, 2003. 



26 



This figure "nl.png" is available in "png" format from: 



http://arxiv.org/ps/! 109.0105 v2 



